Assignment 1

I’m here to learn new skills and improve olds. The course fits perdectly to my needs. I have used R a lot, but havent shared my code outside of our research group, so learning good practises is very welcome. I did go throw Exercise1 ‘warm up’ pretty fast and everything was pretty familiar. As a note for my self, the syntax and used functions was many way different than what i have used to get same goal. Might be good to start to use less ‘intermediate’ objects and more ‘tidyverse’ style. I like to use Atom for writing a script, so it’s be nice to try to do things differently.

My github: https://github.com/jvehvi/IODS-project

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Wed Dec  7 16:56:51 2022"

Assignment 2 Regression and model validation

This week topic was Regression and model validation. Here we present basic R commands to build up summary tables and linear models which can be used for finding cause-and-effect relationships between different variables.

Data wrangling

PART 1

# Set up workind dir
setwd("E:/Open science course 2022/IODS-project") 
dir.create(paste0(getwd(),"/Data"))
## Warning in dir.create(paste0(getwd(), "/Data")): 'E:\Open science course
## 2022\IODS-project\Data' already exists
setwd(paste0(getwd(),"/Data"))

PART 2

Bring .txt file to R environment which has been downloaded to Data - folder.

Table should have 183 rows and 60 cols containing integer values in all except last column which contains information about gender by character.

PART 3

  • Subset Data set to contain columns: gender, age, attitude, deep, stra, surf and points.
  • Remove also rows which have 0 in Points.
  • Deep is calculated by taking the mean of cols: c(“D03”,“D11”,“D19”,“D27”,“D03”,“D11”,“D19”,“D27”,“D06”,“D15”,“D23”,“D31”) and excluding 0.
  • Stra is calculated by taking the mean of cols: c(“ST01”,“ST09”,“ST17”,“ST25”,“ST04”,“ST12”,“ST20”,“ST28”) and excluding 0.
  • Surf is calculated by taking the mean of cols: c(“SU02”,“SU10”,“SU18”,“SU26”,“SU05”,“SU13”,“SU21”,“SU29”,“SU08”,“SU16”,“SU24”,“SU32”) and excluding 0.

PART 4

Analysis

PART 1 - Read file to R environment

Data set contains summary results of course ‘Johdatus yhteiskuntatilastotieteeseen, syksy 2014’ survey. There should be 7 variables (gender, age, attitude, deep, stra, surf and points) and 166 observations.’’’ Gender: Male = 1 Female = 2 Age: Age (in years) derived from the date of birth Attitude: Global attitude toward statistics. Mean of original variables (~Da+Db+Dc+Dd+De+Df+Dg+Dh+Di+Dj) Deep: Deep approach. Mean of original variables (~d_sm+d_ri+d_ue) Stra: Strategic approach. Mean of original variables ( ~st_os+st_tm) Surf: Surface approach. Mean of original variables (~su_lp+su_um+su_sb) Points: Total counts from survey. More information about used variables can be found from http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS2-meta.txt’’’

## [1] 166   8

PART 2

By commend summary we can take a look ‘inside’ the data. By the command we get ‘boxplot information’ about our numerical data in dataframe. Scatterplot matrix is used to describe relationships between the variables. It’s constructed from the dataframe with ggpairs -function (ggplot2 -package). Result plot shows, in addition of variables relationships, variables diverging and gives correlation coefficients with asterix showing level of significance.

## Warning: package 'ggplot2' was built under R version 4.2.2
## Warning: package 'GGally' was built under R version 4.2.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
##     Gender               Age           Attitude          Deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.375  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.250  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.750  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.645  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.000  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.875  
##       Stra            Surf           Points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

Most promising relationship seems to be between: Attitude and Points, and Surf and Deep.
There seems to be also some kind of relationship between: Stra and Surf. As overall, matrix gives good overlook of data, and starting point to study more relationships between variables’’’

PART 3 and PART 4

Create a regression model with multiple explanatory variables

## 
## Call:
## lm(formula = Points ~ Attitude + Stra + Surf, data = learning_2014.2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## Attitude      3.3952     0.5741   5.913 1.93e-08 ***
## Stra          0.8531     0.5416   1.575  0.11716    
## Surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Summary of a regression model shows that only Attitude seems to correlate significantly with Points. From the print, you can see model residiuals; summary table (estimate value, std. error, t-value and p-value for all variables in model against Points). Significance of variable correlation can be read from p-value(last column of Coefficients table). Significance levels threshols are given under the table. Summary gives also p-value for whole model which isn’t significant because model contains variables that hasn’t have relationship to Points. In next step, lest remove other variables from the model and see what happens

## 
## Call:
## lm(formula = Points ~ Attitude, data = learning_2014.2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## Attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Now results seems to be better and p-value is significant for the variable relationship as well as for the model. Multiple R-squared is the proportion of the variation in dependent variable that can be explained by the independent variable. So in the model where we haves three variables 20.74 % of the variation in Points can be explained by variables. But now interesting is that Attitude by itself explains 19.06 % of the variation. Showing us that Stra and Surf effect to the points is pretty minimal.

PART 5 - Study more our model with diagnostic plots

From the ‘Residual vs leverage’ - plot we can check which and how many of observation are influential. In our case data seems good and there isnt any point outside Cook distance lines.

Also ‘Residual vs. Fitted’ - plot seems good. Data is divided evenly in x - and y-axle.

QQ-plot also indicates goodness of our model. If the points runs out too early from the line, there might be some other variables effecting our relationship more than the Attitude variable. In this case QQplot seems to be really nice, but noT perfect. ```


Assignment 3 - Logistic regression

Bring Data to R

For the study material from UCI Machine Learning Repository, Student Performance Data (incl. Alcohol consumption) page (https://archive.ics.uci.edu/ml/datasets/Student+Performance) has been used as a source data table. The joined data set which will be used in the analysis, contains two student alcohol consumption data sets from the web page. The variables not used for joining the two data have been combined by averaging (including the grade variables): + ‘alc_use’ is the average of ‘Dalc’ and ‘Walc’ + ‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise More information about variables can be found from: https://archive.ics.uci.edu/ml/datasets/Student+Performance

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ✔ purrr   0.3.5
## Warning: package 'tibble' was built under R version 4.2.2
## Warning: package 'tidyr' was built under R version 4.2.2
## Warning: package 'readr' was built under R version 4.2.2
## Warning: package 'purrr' was built under R version 4.2.2
## Warning: package 'dplyr' was built under R version 4.2.2
## Warning: package 'stringr' was built under R version 4.2.2
## Warning: package 'forcats' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
# Read data table to R - environment
wrangled_student_mat_por <- as.data.frame(read_csv("Data/wrangled_student_mat_por.csv"))
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl  (1): high_use
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(wrangled_student_mat_por)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Select intresting variables

For studying variables relationships to alcohol, I have selected four variables that I hypothesize to have relationship to person alcohol consumption.

  • Age - I think that persons over 18 use more alcohol.

  • romantic - I think that single persons drinks more often.

  • studytime - I think persons which study more weekly, drink less.

  • freetime - I think persons which have more free time might drink more.

library(ggplot2)
library(GGally)

# Summary table
int_variables<-c("age","romantic","studytime","freetime","alc_use","high_use")
summary(wrangled_student_mat_por[int_variables])
##       age          romantic           studytime        freetime    
##  Min.   :15.00   Length:370         Min.   :1.000   Min.   :1.000  
##  1st Qu.:16.00   Class :character   1st Qu.:1.000   1st Qu.:3.000  
##  Median :17.00   Mode  :character   Median :2.000   Median :3.000  
##  Mean   :16.58                      Mean   :2.043   Mean   :3.224  
##  3rd Qu.:17.00                      3rd Qu.:2.000   3rd Qu.:4.000  
##  Max.   :22.00                      Max.   :4.000   Max.   :5.000  
##     alc_use       high_use      
##  Min.   :1.000   Mode :logical  
##  1st Qu.:1.000   FALSE:259      
##  Median :1.500   TRUE :111      
##  Mean   :1.889                  
##  3rd Qu.:2.500                  
##  Max.   :5.000
# Unique values for romantic column
unique(wrangled_student_mat_por$romantic)
## [1] "no"  "yes"
# Group by romantic before summarise
wrangled_student_mat_por %>% group_by(romantic) %>% summarise(count = n())
## # A tibble: 2 × 2
##   romantic count
##   <chr>    <int>
## 1 no         251
## 2 yes        119
# Scatterplot
p <- ggpairs(wrangled_student_mat_por[int_variables], mapping = aes(), lower =list(combo = wrap("facethist", bins = 20))) 
p

Results shows that there seems to be some kind of relationship at least between alchol and freetime, - studytime and age. POsitive correlation can be seem between alcohol use and freetime. Negative correlation between alcohol use and study time, and alcohol use and age. Intresting find out is that young persons (15-18) seems to drink more than ages 19 -21.

Logistic regression

Use glm() to fit a logistic regression model with high_use as the target variable and freetime, studytime, age and romantic as the predictors.

m <- glm(high_use ~ freetime + studytime + age + factor(romantic), data = wrangled_student_mat_por, family = "binomial")

# Summary
summary(m)
## 
## Call:
## glm(formula = high_use ~ freetime + studytime + age + factor(romantic), 
##     family = "binomial", data = wrangled_student_mat_por)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6628  -0.8579  -0.6654   1.1899   2.3332  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -4.4470     1.7762  -2.504 0.012293 *  
## freetime              0.3269     0.1238   2.641 0.008274 ** 
## studytime            -0.5850     0.1572  -3.721 0.000198 ***
## age                   0.2246     0.1025   2.190 0.028512 *  
## factor(romantic)yes  -0.2158     0.2594  -0.832 0.405404    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 421.66  on 365  degrees of freedom
## AIC: 431.66
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                             OR        2.5 %    97.5 %
## (Intercept)         0.01171392 0.0003399901 0.3664590
## freetime            1.38664482 1.0914115708 1.7751196
## studytime           0.55710460 0.4053055149 0.7516885
## age                 1.25176377 1.0261046822 1.5355541
## factor(romantic)yes 0.80586877 0.4806650258 1.3321754

The results shows that widest range is for variable ‘romantic’. Range also contains 1 so most probably alcohol_consumption and romantic don’t have relationships, more like the variables are independent of each other. Freetime and age has OD and condidence interval > 1, so there seems to be positive correlation between those and high_use. Studytime values are < 1 meaning negative correlation between it and high_use.

Predictive power of the model

Next, I have build up new model excluding ‘romantic’ which seems to have no-relationship to students alcohol use. After building model, I have taken summary information about the model and calculated the mean prediction error.

library(caret)
## Warning: package 'caret' was built under R version 4.2.2
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
# Let's select only freetime, studytime and age for the next model
m <- glm(high_use ~ freetime + studytime + age, data = wrangled_student_mat_por, family = "binomial")

# Summary
summary(m)
## 
## Call:
## glm(formula = high_use ~ freetime + studytime + age, family = "binomial", 
##     data = wrangled_student_mat_por)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6143  -0.8690  -0.6895   1.2088   2.2716  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.2829     1.7668  -2.424 0.015345 *  
## freetime      0.3246     0.1236   2.626 0.008630 ** 
## studytime    -0.5928     0.1575  -3.765 0.000167 ***
## age           0.2119     0.1014   2.089 0.036700 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 422.36  on 366  degrees of freedom
## AIC: 430.36
## 
## Number of Fisher Scoring iterations: 4
# Making predictions on the train set.
# Probability
wrangled_student_mat_por <- mutate(wrangled_student_mat_por, probability = predict(m, type = "response"))
# Prediction
wrangled_student_mat_por <- mutate(wrangled_student_mat_por, prediction = probability > 0.5)
# 2x2 cross tabulation of predictions versus the actual values
table(high_use = wrangled_student_mat_por$high_use, prediction = wrangled_student_mat_por$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   245   14
##    TRUE     96   15
# Making predictions on the test set.
test_pred<- ifelse(predict(m, newdata = wrangled_student_mat_por, type = "response") > 0.5, "TRUE", "FALSE")
# 2x2 cross tabulation of predictions versus the actual values
table(high_use = wrangled_student_mat_por$high_use, prediction = test_pred)
##         prediction
## high_use FALSE TRUE
##    FALSE   245   14
##    TRUE     96   15
# We can study our model goodness more by looking ConfusionMatrix of the results
confusionMatrix(table(high_use = wrangled_student_mat_por$high_use, prediction = test_pred))
## Confusion Matrix and Statistics
## 
##         prediction
## high_use FALSE TRUE
##    FALSE   245   14
##    TRUE     96   15
##                                           
##                Accuracy : 0.7027          
##                  95% CI : (0.6533, 0.7488)
##     No Information Rate : 0.9216          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1028          
##                                           
##  Mcnemar's Test P-Value : 1.136e-14       
##                                           
##             Sensitivity : 0.7185          
##             Specificity : 0.5172          
##          Pos Pred Value : 0.9459          
##          Neg Pred Value : 0.1351          
##              Prevalence : 0.9216          
##          Detection Rate : 0.6622          
##    Detection Prevalence : 0.7000          
##       Balanced Accuracy : 0.6179          
##                                           
##        'Positive' Class : FALSE           
## 
# Prediction plot
# access dplyr and ggplot2
library(dplyr); library(ggplot2)

# initialize a plot of 'high_use' versus 'probability' in 'wrangled_student_mat_por'
# define the geom as points and draw the plot
g <- ggplot(wrangled_student_mat_por, aes(x = high_use, y =as.numeric(probability))) +
    geom_point(aes(y = as.numeric(probability)), alpha = 0.2)

# ROC curve
library('pROC')
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
test_prob <- predict(m, newdata = wrangled_student_mat_por, type = "response")
test_roc <- roc(wrangled_student_mat_por$high_use ~ test_prob, plot = TRUE, print.auc = TRUE)
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases

test_roc
## 
## Call:
## roc.formula(formula = wrangled_student_mat_por$high_use ~ test_prob,     plot = TRUE, print.auc = TRUE)
## 
## Data: test_prob in 259 controls (wrangled_student_mat_por$high_use FALSE) < 111 cases (wrangled_student_mat_por$high_use TRUE).
## Area under the curve: 0.6808
# Define a loss function (mean prediction error)
calc_class_err <- function(actual, predicted) {
  mean(actual != predicted)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
calc_class_err(actual = wrangled_student_mat_por$high_use, predicted = wrangled_student_mat_por$prediction)
## [1] 0.2972973
# call loss_func to compute the average number of wrong predictions in the (test) data
# Error rate should be near each other with training and testing data
calc_class_err(actual = wrangled_student_mat_por$high_use, predicted = test_pred)
## [1] 0.2972973

Our results shows that approximately 30 % of our predictions are wrong, so the accuracy for our model is 0.7 which is at least better than pure guess. We selected 0.5 as a threshold for classification, that value is our choice and it goodness for accuracy can be study from ROC-curve. In this study, it’s okay. Test error rate is same as train error rate, so our model seems to be fitted properly. When predictions are compared to high_use we can see that our predictions have too many < 2 points drinker compared to real situation, and reversal too less high user.

10-fold cross-validation

Lets perform 10-fold cross-validation for our model to see if we get better test set performance. 10-fold means that all our observations are grouped to 10 bins

library(caret)

#specify the cross-validation method
ctrl <- trainControl(method = "cv", number = 10)

#fit a regression model and use k-fold CV to evaluate performance
model <- train(factor(high_use) ~ freetime + studytime + age, data = wrangled_student_mat_por, family = "binomial", method = "glm", trControl = ctrl)
model
## Generalized Linear Model 
## 
## 370 samples
##   3 predictor
##   2 classes: 'FALSE', 'TRUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 333, 333, 333, 333, 333, 333, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7054054  0.1198191
# Show also summary of model
summary(model)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6143  -0.8690  -0.6895   1.2088   2.2716  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.2829     1.7668  -2.424 0.015345 *  
## freetime      0.3246     0.1236   2.626 0.008630 ** 
## studytime    -0.5928     0.1575  -3.765 0.000167 ***
## age           0.2119     0.1014   2.089 0.036700 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 422.36  on 366  degrees of freedom
## AIC: 430.36
## 
## Number of Fisher Scoring iterations: 4
#View final model parameters
model$finalModel
## 
## Call:  NULL
## 
## Coefficients:
## (Intercept)     freetime    studytime          age  
##     -4.2829       0.3246      -0.5928       0.2119  
## 
## Degrees of Freedom: 369 Total (i.e. Null);  366 Residual
## Null Deviance:       452 
## Residual Deviance: 422.4     AIC: 430.4

The results shows that our accuracy is app. 0.70 meaning that 30 % of our predictions are wrong. Actually, results are almost same than with Logistic regression model, so in this case there isn’t any point to use 10-fold cross-validation. Typically, as k gets larger, the difference in size between the training set and the resampling subsets gets smaller. As this difference decreases, the bias of the technique becomes smaller but there is a bias-variance trade-off associated with the choice of k in k-fold cross-validation. Typically choice for k has been done using k=10 or k=5 because those has been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance According summary table of model, study time has most negative impact to alcohol use. Freetime has most positive correlation to high_use, and age variable has also significant effect to drinking habit. Basically, results seems to be line with earlier results.

Perform cross-validation to compare the performance of different logistic regression models

For testing and comparing different models performances, I have select 10 variables as a starting point and removed ‘the worst’ one after one that we have three variables left. Selected variables are: * famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)

  • Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)

  • Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)

  • traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)

  • famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

  • sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)

  • health - current health status (numeric: from 1 - very bad to 5 - very good)

  • freetime - free time after school (numeric: from 1 - very low to 5 - very high)

  • studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

  • age - student’s age (numeric: from 15 to 22)

The data has been divided to test and training sets, so we could better understood those two relationships.

# select only columns under interest
wrangled_student_mat_por_sub <- wrangled_student_mat_por[,colnames(wrangled_student_mat_por) %in% c("famsize","Medu","Fedu","traveltime","famrel","sex","health","freetime","studytime","age","high_use")]

# Divide data to test and training sets by using 70 % of the data for the training set
dat <- list()
dat[['index']] <- sample(nrow(wrangled_student_mat_por_sub), size = nrow(wrangled_student_mat_por_sub) * 0.7)
dat[['training']] <- wrangled_student_mat_por_sub[dat$index, ]
dat[['test']] <- wrangled_student_mat_por_sub[-dat$index,]

# Empty lists
count_of_variables <- test_mses <- training_mses <- list()
# For loop to test different counts of variables
for (i in 1:(ncol(wrangled_student_mat_por_sub)- 1)) {
    
    # Build new model every iteration containing one variable less    
    train_mod <-glm(high_use~., data = dat$training[,i:ncol(dat$training)])
    
    # Predict values for training data
    y_hat_training <- predict(train_mod, type = "response")
    train_predict = y_hat_training > 0.5
    # Collect training error
    training_mses[[i]] <- calc_class_err(actual = dat$training$high_use, predicted = train_predict)
    
    # Predict values for testing data
    y_hat_test <- predict(train_mod, newdata = dat$test)
    test_predict = y_hat_test > 0.5
    test_mses[[i]] <- calc_class_err(actual = dat$test$high_use, predicted = test_predict)
    
    count_of_variables[[i]] <- length(train_mod$coefficients) - 1
}

# Build dataframe from results to be used in ggplot
df<-data.frame(Number_of_Variables=unlist(count_of_variables), Test_error=unlist(test_mses),
                Training_error=unlist(training_mses))
# Build plot
library("reshape2")
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
df_long<-melt(df, id="Number_of_Variables",)
df_long$Number_of_Variables<-factor(df_long$Number_of_Variables, levels=df$Number_of_Variables)
a<-ggplot(df_long, aes(x = Number_of_Variables,y=value, color=variable, group=variable)) +
    geom_point() +
    geom_line() +
    geom_smooth() +
    scale_x_discrete(limits = rev(levels(df$Number_of_Variables)))
a
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

In the plot, test - and training errors against ‘Number of prediction variables in model’ are shown with own lines. I have selected ten variables which might have effect to alcohol consumption. Typically, training error which is less than test error indicates that model over fits to the training data. The model can be think to be good enough when test and training errors are near each other. E.g. from the plot, can be see that training error first rise in range 10-8 and in same range test error is lower and remains almost unchanged. Over fitting can be seen with range 7-5 and 3-1. When we have four variables in our model, test and training error are near each others indicating goodness of our model. And this results, seems to change depending how data is divided to test and training sets. According these results, the logistic regression model with four variables seems to be good if selected variables are good. Different grouping for selected variables should be also performed to catch out which four variables would be the best ones.


Assignment 4: Clustering and classification

This assignment topic is clustering and classification. As basic, clustering means that in the data some points/observations are in space so near each other compared to other points/observations that those form a ‘own cluster’. There are multiple different clustering methods to find those clusters from the data. One of the most common methods is k-means clustering, others worth naming are hierarchical clustering methods which gives dendrograms as a output. If the clustering can be done successfully, new observations can be tried to classify to those clusters.

To clarify this topic, Boston dataset from MASS - package will be used as a demonstration dataset.

The Boston Dataset

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. The Boston data frame has 506 rows and 14 columns. The following describes the dataset columns:

Summary and graphical overview of the data

# Bring needed libraries to R-environment
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(ggplot2)
library(GGally)
library(reshape2)
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
# Check head of Boston table
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
# Check dimensions
dim(Boston)
## [1] 506  14
# Check summary of data
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# Visualize data for inspecting
## Scatterplot
ggpairs(Boston, mapping = aes(), lower =list(combo = wrap("facethist", bins = 20))) 

## All variables against crim rate
bosmelt <- melt(Boston, id="crim")
ggplot(bosmelt, aes(x=value, y=crim))+
  facet_wrap(~variable, scales="free")+
  geom_point()

# Correlation matrix and visualization
## Matrix
corrmatrix <- cor(Boston)
corrmatrix
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax    ptratio
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867
##               black      lstat       medv
## crim    -0.38506394  0.4556215 -0.3883046
## zn       0.17552032 -0.4129946  0.3604453
## indus   -0.35697654  0.6037997 -0.4837252
## chas     0.04878848 -0.0539293  0.1752602
## nox     -0.38005064  0.5908789 -0.4273208
## rm       0.12806864 -0.6138083  0.6953599
## age     -0.27353398  0.6023385 -0.3769546
## dis      0.29151167 -0.4969958  0.2499287
## rad     -0.44441282  0.4886763 -0.3816262
## tax     -0.44180801  0.5439934 -0.4685359
## ptratio -0.17738330  0.3740443 -0.5077867
## black    1.00000000 -0.3660869  0.3334608
## lstat   -0.36608690  1.0000000 -0.7376627
## medv     0.33346082 -0.7376627  1.0000000
## Visualization
corrplot(corrmatrix, method="circle", cl.pos = "b", tl.pos = "d", tl.cex = 0.6, type = "upper")

There are 506 different observations for 14 variables whivh summaries informations concerning housing in the area of Boston Mass. By using this dataset, we can try to find out different relationships between variables. E.g. per capita crime rate relationship to other variables is very interesting. If we could find out variables which has the most effect to it, could in future try to effect these variables and in advance reduce develop of crime rate when planning new residential areas. From the summary table, we can see that not all information from variables are easy or wise to use as a predictors in same model as those are now. E.g. “proportion of residential land zoned for lots over 25,000 sq.ft.” (zn) - variable observations are unevenly distributed where as “pupil-teacher ratio by town” (ptratio) - variable observations are much more evenly distributed. But interesting is that when we plot other variables against crime rate, we can see that there could be correlation even that the distribution wouldn’t be so clear. From the correlation matrix, we can see that there are some kind positive or negative correlation with per capita crime rate and other variables. Six variables have negative correlation and seven have positive. Most less negative correlation is with “Charles River dummy variable (1 if tract bounds river; 0 otherwise)” (chas) and overall negative correlate factors impact seems to be low. Positively correlate factors impact seems to be much higher and “index of accessibility to radial highways” (rad) - and “full-value property-tax rate per $10,000” (tax) seems to have the most positive correlation to per capita crime rate. Same results are visualized in the plot where conclusions can be made more easily.

Standardize the dataset

To get more clear understanding about the data and make it usable in prediction model, it has to be standardized. To do that, we scale and center the columns of a numeric matrix. Centering is done by subtracting the column means (omitting NAs) of x from their corresponding columns. Scaling is done by dividing the (centered) columns of x by their standard deviations.

# Check head of original data to compare scaled data
data("Boston")
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
# center and standardize variables, also set output to be data.frame
boston_scaled <- as.data.frame(scale(Boston, center = TRUE, scale = TRUE))
head(boston_scaled)
##         crim         zn      indus       chas        nox        rm        age
## 1 -0.4193669  0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948
## 2 -0.4169267 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824  0.3668034
## 3 -0.4169290 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490
## 4 -0.4163384 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878
## 5 -0.4120741 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743
## 6 -0.4166314 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100
##        dis        rad        tax    ptratio     black      lstat       medv
## 1 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990  0.1595278
## 2 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239
## 3 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324  1.3229375
## 4 1.076671 -0.7521778 -1.1050216  0.1129203 0.4157514 -1.3601708  1.1815886
## 5 1.076671 -0.7521778 -1.1050216  0.1129203 0.4406159 -1.0254866  1.4860323
## 6 1.076671 -0.7521778 -1.1050216  0.1129203 0.4101651 -1.0422909  0.6705582
# Set 'crim' to numeric variable
boston_scaled$crim <- as.numeric(boston_scaled$crim)
# Summary of crim variable
summary(boston_scaled[,'crim'])
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# class of the boston_scaled object
class(boston_scaled)
## [1] "data.frame"
# Inspect how values has changed 
# Check summary of data
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# Visualize data for inspecting
## Scatterplot
ggpairs(boston_scaled, mapping = aes(), lower =list(combo = wrap("facethist", bins = 20))) 

## All variables against crim rate
bosmelt <- melt(boston_scaled, id="crim")
ggplot(bosmelt, aes(x=value, y=crim))+
  facet_wrap(~variable, scales="free")+
  geom_point()

# Correlation matrix and visualization
## Matrix
corrmatrix <- cor(boston_scaled)
corrmatrix
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax    ptratio
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867
##               black      lstat       medv
## crim    -0.38506394  0.4556215 -0.3883046
## zn       0.17552032 -0.4129946  0.3604453
## indus   -0.35697654  0.6037997 -0.4837252
## chas     0.04878848 -0.0539293  0.1752602
## nox     -0.38005064  0.5908789 -0.4273208
## rm       0.12806864 -0.6138083  0.6953599
## age     -0.27353398  0.6023385 -0.3769546
## dis      0.29151167 -0.4969958  0.2499287
## rad     -0.44441282  0.4886763 -0.3816262
## tax     -0.44180801  0.5439934 -0.4685359
## ptratio -0.17738330  0.3740443 -0.5077867
## black    1.00000000 -0.3660869  0.3334608
## lstat   -0.36608690  1.0000000 -0.7376627
## medv     0.33346082 -0.7376627  1.0000000
## Visualization
corrplot(corrmatrix, method="circle", cl.pos = "b", tl.pos = "d", tl.cex = 0.6, type = "upper")

After scaling we can see that values across different variables are much more near each others. From the correlation matrix, we see that we got same results as with raw values.

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled[['crim']])
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled[['crim']], breaks = bins, include.lowest = TRUE)

# look at the table of the new factor crime
head(boston_scaled)
##         crim         zn      indus       chas        nox        rm        age
## 1 -0.4193669  0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948
## 2 -0.4169267 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824  0.3668034
## 3 -0.4169290 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490
## 4 -0.4163384 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878
## 5 -0.4120741 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743
## 6 -0.4166314 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100
##        dis        rad        tax    ptratio     black      lstat       medv
## 1 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990  0.1595278
## 2 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239
## 3 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324  1.3229375
## 4 1.076671 -0.7521778 -1.1050216  0.1129203 0.4157514 -1.3601708  1.1815886
## 5 1.076671 -0.7521778 -1.1050216  0.1129203 0.4406159 -1.0254866  1.4860323
## 6 1.076671 -0.7521778 -1.1050216  0.1129203 0.4101651 -1.0422909  0.6705582
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# Set 'crime' to factor variable, and give relevant levels
boston_scaled$crime <- factor(boston_scaled$crime, levels = c("low", "med_low", "med_high", "high"))

Divide the dataset to train and test sets

  • Data will be divided 80/20 to train and test sets
# Bring Boston data table from github to R

boston_scaled <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/boston_scaled.txt", sep=",", header = T)

# Change 'crime' to factor variable with relevant levels
boston_scaled$crime <- factor(boston_scaled$crime, levels = c("low", "med_low", "med_high", "high"))

# Build up test and train datasets
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]
dim(train)
## [1] 404  14
# create test set 
test <- boston_scaled[-ind,]
dim(test)
## [1] 102  14
# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)
head(test)
##             zn      indus       chas        nox         rm         age
## 6  -0.48724019 -1.3055857 -0.2723291 -0.8344581  0.2068916 -0.35080997
## 12  0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.3922967  0.50890509
## 20 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -0.7936533  0.03286452
## 23 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -0.2030044  0.82152875
## 27 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -0.6712537  0.77179316
## 36 -0.48724019 -0.7545936 -0.2723291 -0.4806367 -0.5004637 -0.01331852
##              dis        rad        tax    ptratio     black       lstat
## 6   1.0766711351 -0.7521778 -1.1050216  0.1129203 0.4101651 -1.04229087
## 12  1.1547920492 -0.5224844 -0.5769480 -1.5037485 0.4406159  0.08639286
## 20  0.0006920764 -0.6373311 -0.6006817  1.1753027 0.3754425 -0.19227719
## 23  0.0863638874 -0.6373311 -0.6006817  1.1753027 0.4406159  0.84958472
## 27  0.4212152950 -0.6373311 -0.6006817  1.1753027 0.2213265  0.30204708
## 36 -0.2064589434 -0.5224844 -0.7668172  0.3438730 0.4406159 -0.41633352
##          medv
## 6   0.6705582
## 12 -0.3949946
## 20 -0.4711055
## 23 -0.7972951
## 27 -0.6450733
## 36 -0.3949946

Now we have build up test and training data by selecting randomly first 80 % off data as a training set and using 20 % as a test set. Both data set will be used in next part.

Fit the linear discriminant analysis on the train set

  • Crime rate as the target variable and all the other variables in the dataset as predictor variables.

Now we use training set in linear discriminat analysis. The function tries hard to detect if the within-class covariance matrix is singular. If any variable has within-group variance less than ‘tolerance to decide’^2 it will stop and report the variable as constant. As a return function gives:

*prior - the prior probabilities used.

*means - the group means.

*scaling - a matrix which transforms observations to discriminant functions, normalized so that within groups covariance matrix is spherical.

*svd - the singular values, which give the ratio of the between- and within-group standard deviations on the linear discriminant variables. Their squares are the canonical F-statistics.

*N - The number of observations used.

*call - The (matched) function call.

# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2648515 0.2475248 0.2500000 0.2376238 
## 
## Group means:
##                   zn      indus        chas        nox         rm        age
## low       1.07176448 -0.9356148 -0.08835242 -0.8973829  0.5124648 -0.9082934
## med_low  -0.04174674 -0.3192808 -0.07547406 -0.6032746 -0.1042309 -0.3696740
## med_high -0.37686340  0.1909744  0.27340760  0.4760738  0.1062354  0.5162916
## high     -0.48724019  1.0172418 -0.02626030  1.1040085 -0.3780642  0.8451013
##                 dis        rad        tax     ptratio       black       lstat
## low       0.9131261 -0.6963644 -0.7203480 -0.51863814  0.37560727 -0.79809304
## med_low   0.4464244 -0.5431568 -0.4956009 -0.08107991  0.31516853 -0.12828110
## med_high -0.4382173 -0.4383393 -0.3310346 -0.39243158  0.03770318  0.06804964
## high     -0.8755770  1.6368728  1.5131579  0.77931510 -0.80779200  0.91531375
##                  medv
## low       0.566096237
## med_low   0.006001202
## med_high  0.178044164
## high     -0.705441038
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.10193276  0.48949553 -1.03333271
## indus    0.02553774 -0.11852977  0.47768899
## chas    -0.08387932 -0.08942036 -0.04284944
## nox      0.32839483 -0.83499412 -1.13292343
## rm      -0.10698896 -0.03747740 -0.16697735
## age      0.25393988 -0.44380502 -0.15471613
## dis     -0.10685689 -0.14393142  0.38476532
## rad      3.34535600  0.93596279  0.08415850
## tax      0.03460748  0.13738182  0.24412744
## ptratio  0.10861001 -0.03893911 -0.26763785
## black   -0.11109600  0.07087897  0.13790992
## lstat    0.24606490 -0.16787820  0.39551965
## medv     0.19013972 -0.36910548 -0.03833447
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9434 0.0431 0.0135
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2) +
  lda.arrows(lda.fit, myscale = 1)

## integer(0)

From the results plot we can see that rad seems to most driving variable in our training set for explaining LD1. And from the result object we can see that it has most positive coefficient 3.62727 to LD1 whereas second most effective variable seems to be nox with only 0.307. For the LD2, zn has highest positive coefficient 0.624225492. ## Predict the classes with the LDA model on the test data

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
lda.pred
## $class
##   [1] med_low  med_low  med_low  med_high med_high med_low  low      med_low 
##   [9] low      med_low  med_low  med_low  low      low      med_low  med_low 
##  [17] low      med_high med_high med_high med_low  med_low  med_low  med_high
##  [25] med_high med_high med_high med_high med_high med_high med_high med_low 
##  [33] med_high med_low  med_high med_low  med_low  med_low  med_high med_high
##  [41] med_low  med_high med_high med_low  med_low  low      med_low  low     
##  [49] low      low      med_low  low      low      med_low  med_high med_low 
##  [57] med_low  med_low  med_low  med_low  med_low  med_low  med_low  low     
##  [65] high     high     high     high     high     high     high     high    
##  [73] high     high     high     high     high     high     high     high    
##  [81] high     high     high     high     high     high     high     high    
##  [89] high     high     high     high     high     high     high     high    
##  [97] high     med_high med_high med_high med_high med_high
## Levels: low med_low med_high high
## 
## $posterior
##              low      med_low     med_high         high
## 6   3.545608e-01 5.927212e-01 5.271798e-02 1.561903e-21
## 12  9.491999e-02 7.026950e-01 2.023850e-01 6.229562e-17
## 20  1.681783e-01 5.187549e-01 3.130667e-01 2.326886e-17
## 23  5.621475e-02 4.270415e-01 5.167437e-01 2.331332e-16
## 27  6.515744e-02 4.372181e-01 4.976244e-01 1.227103e-16
## 36  2.082115e-01 6.233536e-01 1.684349e-01 6.128147e-17
## 40  9.913195e-01 8.255192e-03 4.252916e-04 1.356801e-22
## 46  2.705772e-01 7.225214e-01 6.901376e-03 3.041111e-22
## 57  9.906925e-01 9.170553e-03 1.369043e-04 8.273185e-25
## 71  2.781487e-01 7.196751e-01 2.176198e-03 2.768554e-22
## 72  1.792581e-01 8.177273e-01 3.014606e-03 1.994555e-21
## 80  1.149919e-01 8.763678e-01 8.640270e-03 4.307319e-19
## 81  7.305531e-01 2.615256e-01 7.921303e-03 6.620525e-21
## 83  6.991652e-01 2.945936e-01 6.241264e-03 8.952738e-21
## 87  2.585385e-01 7.212588e-01 2.020264e-02 4.067467e-21
## 88  3.727484e-01 6.005606e-01 2.669109e-02 3.564293e-21
## 96  4.898678e-01 4.799401e-01 3.019210e-02 1.469722e-22
## 102 1.528741e-01 4.080003e-01 4.391256e-01 2.961440e-16
## 108 7.491346e-02 4.149658e-01 5.101207e-01 5.793069e-15
## 114 2.155002e-02 3.821305e-01 5.963195e-01 1.826666e-13
## 119 4.019327e-02 4.956098e-01 4.641969e-01 9.078849e-14
## 120 6.486443e-02 6.326988e-01 3.024368e-01 1.731433e-14
## 121 5.200263e-02 5.227491e-01 4.252483e-01 2.261051e-19
## 128 8.285731e-03 1.413931e-01 8.503212e-01 7.487762e-15
## 129 8.953100e-03 1.068463e-01 8.842006e-01 2.449974e-15
## 133 8.001254e-03 8.284981e-02 9.091489e-01 1.298947e-15
## 139 6.565713e-03 1.488292e-01 8.446051e-01 1.433110e-14
## 141 5.609179e-03 1.492105e-01 8.451803e-01 1.740247e-14
## 153 2.384037e-04 1.443389e-03 9.983182e-01 5.043100e-15
## 165 1.402392e-02 3.276810e-01 6.582951e-01 2.758154e-15
## 169 1.006851e-02 1.904141e-01 7.995173e-01 5.318824e-15
## 182 2.496288e-01 4.894343e-01 2.609369e-01 3.068778e-19
## 184 2.178514e-01 3.358899e-01 4.462586e-01 3.272419e-19
## 185 1.292536e-01 5.344368e-01 3.363097e-01 2.739780e-18
## 187 2.767939e-01 3.055358e-01 4.176703e-01 3.254375e-20
## 206 2.277464e-01 7.506466e-01 2.160702e-02 5.888956e-20
## 211 4.705169e-02 6.121853e-01 3.407631e-01 3.531202e-18
## 216 2.015500e-01 7.414937e-01 5.695629e-02 2.338104e-19
## 220 3.861055e-02 3.345582e-01 6.268313e-01 2.057364e-17
## 225 2.026397e-02 1.191908e-01 8.605453e-01 7.185009e-13
## 230 2.122772e-01 6.494321e-01 1.382908e-01 1.963554e-14
## 232 5.373174e-02 3.229284e-01 6.233399e-01 4.229971e-13
## 233 3.844782e-02 1.813244e-01 7.802277e-01 1.574827e-13
## 236 5.719152e-02 6.099055e-01 3.329029e-01 1.262540e-12
## 247 2.218334e-01 7.554230e-01 2.274357e-02 3.280553e-16
## 252 4.994588e-01 4.939139e-01 6.627355e-03 8.174598e-18
## 254 3.812367e-01 5.850111e-01 3.375229e-02 8.907098e-18
## 277 8.397256e-01 1.305080e-01 2.976638e-02 4.775162e-21
## 289 7.577569e-01 2.370850e-01 5.158161e-03 5.130903e-18
## 290 7.729679e-01 2.239995e-01 3.032629e-03 1.983750e-18
## 295 8.046844e-02 9.131106e-01 6.420987e-03 4.420102e-21
## 303 5.385572e-01 4.517922e-01 9.650616e-03 1.111016e-16
## 306 5.565299e-01 2.419890e-01 2.014811e-01 6.407498e-14
## 311 2.231956e-01 6.896357e-01 8.716872e-02 5.573166e-18
## 313 9.103693e-02 4.354239e-01 4.735392e-01 5.016534e-17
## 316 1.248204e-01 6.142727e-01 2.609069e-01 1.017406e-17
## 318 7.159943e-02 6.251157e-01 3.032848e-01 3.248314e-17
## 319 1.488183e-01 5.353648e-01 3.158168e-01 5.737010e-18
## 320 1.371898e-01 6.456615e-01 2.171487e-01 4.527459e-18
## 327 2.988617e-01 6.623244e-01 3.881386e-02 4.408365e-19
## 331 2.568054e-01 7.359374e-01 7.257124e-03 3.632063e-21
## 336 2.832175e-01 6.326759e-01 8.410662e-02 2.024916e-18
## 338 1.755426e-01 6.377165e-01 1.867409e-01 2.801801e-17
## 353 8.956002e-01 1.040288e-01 3.710187e-04 2.403038e-22
## 361 2.302845e-20 2.072214e-17 3.178369e-13 1.000000e+00
## 371 4.047307e-19 5.575995e-16 6.333154e-12 1.000000e+00
## 384 4.143297e-22 2.414889e-18 4.747589e-15 1.000000e+00
## 385 1.010339e-23 1.329848e-19 2.002230e-16 1.000000e+00
## 387 4.048151e-23 4.537602e-19 7.147343e-16 1.000000e+00
## 389 2.051357e-23 2.787188e-19 4.942603e-16 1.000000e+00
## 390 1.900087e-21 7.571251e-18 1.212872e-14 1.000000e+00
## 394 7.123624e-20 1.163124e-16 1.575695e-13 1.000000e+00
## 395 3.012242e-20 6.169816e-17 8.012261e-14 1.000000e+00
## 396 3.189120e-20 6.086197e-17 1.080342e-13 1.000000e+00
## 407 1.713553e-22 1.523402e-18 1.050106e-15 1.000000e+00
## 414 1.044589e-21 7.118894e-18 3.340270e-15 1.000000e+00
## 415 1.069990e-25 3.003396e-21 1.070757e-17 1.000000e+00
## 418 4.878765e-23 3.319497e-19 6.394369e-16 1.000000e+00
## 429 2.422235e-21 6.203036e-18 9.649934e-15 1.000000e+00
## 431 3.430815e-20 9.921709e-17 3.177073e-14 1.000000e+00
## 433 7.760851e-19 1.097494e-15 2.287329e-13 1.000000e+00
## 446 2.600052e-23 8.411001e-20 1.378304e-15 1.000000e+00
## 448 1.257565e-20 2.156357e-17 9.012200e-14 1.000000e+00
## 454 3.161428e-20 5.547854e-17 2.575826e-13 1.000000e+00
## 455 2.459428e-22 5.500760e-19 6.360770e-15 1.000000e+00
## 461 1.059354e-20 1.960689e-17 8.028836e-14 1.000000e+00
## 462 5.137050e-20 9.174659e-17 2.170510e-13 1.000000e+00
## 467 8.843799e-22 3.118413e-18 7.360871e-15 1.000000e+00
## 469 9.238297e-19 4.185244e-15 2.842024e-13 1.000000e+00
## 470 9.554179e-18 2.706658e-14 9.840723e-13 1.000000e+00
## 472 7.855117e-18 2.888815e-14 1.214264e-12 1.000000e+00
## 473 2.966860e-18 9.145536e-15 9.444441e-13 1.000000e+00
## 474 2.362004e-18 4.318296e-15 1.502068e-12 1.000000e+00
## 475 8.701670e-20 4.637081e-16 5.807108e-14 1.000000e+00
## 482 2.190386e-16 3.764827e-13 1.217111e-11 1.000000e+00
## 484 9.429037e-16 2.171413e-12 1.333873e-11 1.000000e+00
## 488 5.085316e-17 9.540460e-14 3.306251e-12 1.000000e+00
## 493 1.481666e-02 4.619755e-01 5.232079e-01 1.434508e-15
## 495 7.694684e-02 4.219289e-01 5.011243e-01 2.924647e-14
## 500 3.799992e-02 3.170623e-01 6.449378e-01 1.702051e-13
## 501 3.929464e-02 2.728570e-01 6.878484e-01 9.605247e-14
## 504 2.721136e-01 1.824734e-01 5.454130e-01 1.931744e-21
## 
## $x
##            LD1          LD2          LD3
## 6   -3.1327775 -0.257375162  0.596864160
## 12  -1.9130092 -0.382788193  0.863447322
## 20  -2.0514370 -0.503081035  0.147046596
## 23  -1.7364077 -0.937449113  0.392935314
## 27  -1.8161454 -0.925832373  0.355017021
## 36  -1.9507867 -0.013881109  0.364116703
## 40  -3.2145965  2.107239116 -2.299247518
## 46  -3.2687373  0.516400534  1.598007244
## 57  -3.7641389  2.190272101 -1.744368690
## 71  -3.2589168  1.096058040  1.940365648
## 72  -3.0256839  0.969300064  2.164009210
## 80  -2.4256449  0.800260096  2.040032997
## 81  -2.9469469  1.056857796  0.005317303
## 83  -2.9108408  1.194830711  0.206004917
## 87  -2.9979631  0.210289858  1.233635317
## 88  -3.0315505  0.181918937  0.775129707
## 96  -3.3953594 -0.097471354  0.444456475
## 102 -1.7608557 -0.473536924 -0.173920310
## 108 -1.3939315 -0.519897011  0.140127927
## 114 -0.9407070 -0.720321736  0.656723094
## 119 -1.0580184 -0.425208181  0.628319383
## 120 -1.2698356 -0.185819508  0.743692594
## 121 -2.5072115 -1.525417114  0.825376316
## 128 -1.2127657 -1.594476004  0.260135767
## 129 -1.3317508 -1.705844905 -0.026181019
## 133 -1.3872242 -1.832633008 -0.187297633
## 139 -1.1293757 -1.610488687  0.427925666
## 141 -1.0991161 -1.648929073  0.515911695
## 153 -0.8938027 -3.197744872 -1.897891247
## 165 -1.3791521 -1.334710833  0.819677146
## 169 -1.2714128 -1.512850984  0.444135340
## 182 -2.5491250 -0.687486447  0.015803936
## 184 -2.5306548 -1.016339111 -0.415218943
## 185 -2.2770313 -0.838266464  0.346534989
## 187 -2.7958983 -1.122583850 -0.568874937
## 206 -2.6964823  0.389307395  1.265550364
## 211 -2.1977115 -1.179424180  1.037036616
## 216 -2.5541092 -0.009063296  0.993432879
## 220 -1.9800331 -1.414025216  0.377286057
## 225 -0.7491814 -0.846617434 -0.503760157
## 230 -1.3082188  0.648098132  0.331425137
## 232 -0.8929429 -0.339382198 -0.048494397
## 233 -0.9671818 -0.693376251 -0.428265613
## 236 -0.7863959  0.131011642  0.663204838
## 247 -1.7372777  1.182982664  1.090718423
## 252 -2.1542895  1.721036428  0.704428132
## 254 -2.1664386  0.822755758  0.502489021
## 277 -2.9903871  0.380948732 -1.108167836
## 289 -2.1976936  1.919996203 -0.109054255
## 290 -2.2924053  2.099774590  0.013631775
## 295 -2.9107952  0.382123398  2.468927651
## 303 -1.8721666  1.806483924  0.409357347
## 306 -1.2013213  0.877272780 -1.253069391
## 311 -2.2125562  0.115763118  0.670974782
## 313 -1.9331571 -0.866271128  0.194369133
## 316 -2.1295676 -0.591326389  0.544087020
## 318 -1.9729859 -0.755246323  0.806245046
## 319 -2.2016065 -0.684750634  0.271806836
## 320 -2.2232362 -0.540694928  0.609235445
## 327 -2.4942308  0.381484122  0.772733878
## 331 -2.9917056  0.711402248  1.577155901
## 336 -2.3345611  0.118630523  0.489278070
## 338 -2.0310535 -0.201444676  0.466318910
## 353 -3.2357350  2.306321781  0.079086031
## 361  6.5306257 -0.333721019 -1.356578228
## 371  6.1939641 -0.653167297 -0.951105303
## 384  6.9123010  0.228444211  0.306019146
## 385  7.2850409  0.346567472  0.794060720
## 387  7.1387872  0.264033616  0.713799272
## 389  7.2014729  0.181924260  0.776593272
## 390  6.7679001  0.358716646  0.176440949
## 394  6.4175812  0.502565495 -0.223980181
## 395  6.5015019  0.502805728 -0.094523066
## 396  6.4932137  0.372893676 -0.232603361
## 407  7.0067215  0.645884013  0.864230179
## 414  6.8277050  0.786702051  0.864398440
## 415  7.7325023  0.004838518  0.878981765
## 418  7.1419333  0.373411607  0.359437223
## 429  6.7659630  0.552235398 -0.071081492
## 431  6.4941839  1.034825361  0.548764159
## 433  6.1951481  1.277711965  0.324077794
## 446  7.2129491 -0.299987273 -0.761076620
## 448  6.5865384  0.081723765 -0.582246814
## 454  6.4807939 -0.070175910 -0.582135046
## 455  6.9901335 -0.172243383 -0.824084873
## 461  6.6017558  0.073511761 -0.534233264
## 462  6.4384979  0.213331475 -0.351466887
## 467  6.8524292  0.293661274 -0.032682712
## 469  6.1322782  1.291102022  1.366437706
## 470  5.9102891  1.594666539  1.331895627
## 472  5.9148734  1.421407800  1.437073195
## 473  6.0160414  1.144799434  1.032147502
## 474  6.0475619  0.796449171  0.339636682
## 475  6.3744433  1.136832331  1.222588181
## 482  5.5919541  1.581142425  1.137928082
## 484  5.4446081  2.140090805  1.859106565
## 488  5.7481493  1.646538322  1.137690696
## 493 -1.4631837 -1.247224490  1.183849273
## 495 -1.2156945 -0.345019796  0.111699991
## 500 -0.9748089 -0.569848116  0.141133081
## 501 -1.0359850 -0.651676133 -0.021849518
## 504 -3.0950352 -1.556756813 -1.049297428
# Define a loss function (mean prediction error)
calc_class_err <- function(actual, predicted) {
  mean(actual != predicted)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
calc_class_err(actual = correct_classes, predicted = lda.pred$class)
## [1] 0.3529412
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low        8      10        2    0
##   med_low    4      13        9    0
##   med_high   0       9       14    2
##   high       0       0        0   31

The results shows as that the high and low variable groups are grouped much better to rigth classes compared to med_high and med_low groups. Overall our error rate is still around 33 % depending how observations are divided to train and test data. This is pretty high and could be better.

Reload the Boston dataset and standardize the dataset

To get better error rate we can try to do standardise our data set by using distances. * Scale all variables to have mean = 0 and standard deviation = 1

# Bring data to R
library(MASS)
library(dplyr)
data(Boston)
# Standaridise data
standardised_boston <- as.data.frame(Boston %>%mutate_all(~(scale(.) %>% as.vector)))

# Calculate distances

## Euclidean distance matrix
dist_eu <- dist(standardised_boston, method = "euclidean")

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(standardised_boston, method = "manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618
set.seed(13)

# K-means clustering, center 4
km <- kmeans(Boston, centers = 4)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

# Summary of clusters
summary(km$cluster)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   2.269   2.000   4.000
# K-means clustering, center 3
km <- kmeans(Boston, centers = 3)
# plot the Boston dataset with clusters
pairs(Boston[1:5], col = km$cluster)

pairs(Boston[6:10], col = km$cluster)

pairs(Boston[11:ncol(Boston)], col = km$cluster)

# Summary of clusters
summary(km$cluster)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.929   2.000   3.000
# K-means clustering, center 2
km <- kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston[1:5], col = km$cluster)

pairs(Boston[6:10], col = km$cluster)

pairs(Boston[11:ncol(Boston)], col = km$cluster)

# Summary of clusters
summary(km$cluster)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.729   2.000   2.000
# K-means clustering, center 2
km <- kmeans(Boston, centers = 5)
# plot the Boston dataset with clusters
pairs(Boston[1:5], col = km$cluster)

pairs(Boston[6:10], col = km$cluster)

pairs(Boston[11:ncol(Boston)], col = km$cluster)

# Summary of clusters
summary(km$cluster)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   3.000   2.666   3.000   5.000
# K-means clustering, center 2
km <- kmeans(Boston, centers = 6)
# plot the Boston dataset with clusters
pairs(Boston[1:5], col = km$cluster)

pairs(Boston[6:10], col = km$cluster)

pairs(Boston[11:ncol(Boston)], col = km$cluster)

# Summary of clusters
summary(km$cluster)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   5.000   4.312   6.000   6.000

From the summaries, we can see that distances between used method differs quite much qiven longer distances by Manhattan method. Euclidean distance is the shortest path between source and destination. Manhattan distance is sum of all the real distances between source and destination. Two clusters seems to be best. By using center 4, number of cluster range between 1-4, and 2 seems to be the most common. When studying by setting higher value for center, 4 cluster seems to be most common choice. So, it seems to be hard to use summary as a only way to select good number of clusters. But from the plots we can see that 4 clusters could be presented in the data, but even that atleast 2 and 3 clusters should be also tried.

Bonus: k-means on the original Boston data with 3 clusters

data(Boston)
# Standaridise data
standardised_boston <- as.data.frame(Boston %>%mutate_all(~(scale(.) %>% as.vector)))

# K-means clustering, center 3
km <- kmeans(standardised_boston, centers = 3)
# plot the Boston dataset with clusters
pairs(standardised_boston[1:5], col = km$cluster)

pairs(standardised_boston[6:10], col = km$cluster)

pairs(standardised_boston[11:ncol(standardised_boston)], col = km$cluster)

# Summary of clusters
summary(km$cluster)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.743   2.000   3.000
standardised_boston$N_clusters <- factor(km$cluster)

# Build up test and train datasets
# choose randomly 80% of the rows
ind <- sample(nrow(standardised_boston),  size = n * 0.8)

# create train set
train <- standardised_boston[ind,]
dim(train)
## [1] 404  15
# create test set 
test <- standardised_boston[-ind,]
dim(test)
## [1] 102  15
# save the 'correct' clusters from test data
correct_classes <- test$N_clusters

# remove the N_clusters variable from test data
test <- dplyr::select(test, -N_clusters)
head(test)
##          crim          zn      indus       chas        nox         rm
## 3  -0.4169290 -0.48724019 -0.5927944 -0.2723291 -0.7395304  1.2814456
## 8  -0.4032966  0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.1603069
## 10 -0.4003331  0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.3994130
## 11 -0.3939564  0.04872402 -0.4761823 -0.2723291 -0.2648919  0.1314594
## 21 -0.2745709 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -1.0171035
## 23 -0.2768170 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -0.2030044
##           age         dis        rad        tax    ptratio     black      lstat
## 3  -0.2655490 0.556609050 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324
## 8   0.9778406 1.023624897 -0.5224844 -0.5769480 -1.5037485 0.4406159  0.9097999
## 10  0.6154813 1.328320207 -0.5224844 -0.5769480 -1.5037485 0.3289995  0.6227277
## 11  0.9138948 1.211779950 -0.5224844 -0.5769480 -1.5037485 0.3926395  1.0918456
## 21  1.0488914 0.001356935 -0.6373311 -0.6006817  1.1753027 0.2179309  1.1716657
## 23  0.8215287 0.086363887 -0.6373311 -0.6006817  1.1753027 0.4406159  0.8495847
##          medv
## 3   1.3229375
## 8   0.4965904
## 10 -0.3949946
## 11 -0.8190411
## 21 -0.9712629
## 23 -0.7972951
# linear discriminant analysis
lda.fit <- lda(N_clusters ~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(N_clusters ~ ., data = train)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.4702970 0.3267327 0.2029703 
## 
## Group means:
##         crim         zn      indus        chas        nox          rm
## 1 -0.3724278 -0.3387499 -0.2829898  0.03849464 -0.3098577 -0.07912173
## 2  0.7707649 -0.4872402  1.1304030  0.05576262  1.1890969 -0.51987167
## 3 -0.4082840  1.5687708 -1.1015857 -0.08027540 -1.0112004  0.88743088
##           age         dis        rad        tax    ptratio      black
## 1  0.02723655  0.01214864 -0.5786988 -0.6052722 -0.1000423  0.2679608
## 2  0.82255145 -0.85370946  1.1680242  1.2695565  0.5468315 -0.5919408
## 3 -1.17655926  1.26178915 -0.6037174 -0.6590027 -0.7421679  0.3540631
##        lstat        medv
## 1 -0.1783253  0.06127032
## 2  0.8633549 -0.69424886
## 3 -0.9375719  0.95206252
## 
## Coefficients of linear discriminants:
##                 LD1          LD2
## crim    -0.05143915  0.128168802
## zn      -0.05754639  1.091843941
## indus    0.49017626  0.071189895
## chas     0.04698341 -0.068524065
## nox      1.06863590  0.719733618
## rm      -0.15043419  0.380911844
## age     -0.12108442 -0.593230736
## dis     -0.12373888  0.564951427
## rad      0.69328145  0.007068937
## tax      0.92825702  0.816155360
## ptratio  0.24143922  0.085061886
## black    0.01470321 -0.013028639
## lstat    0.17568563  0.484228519
## medv    -0.06969142  0.626189685
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8378 0.1622
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$N_clusters)

# plot the lda results
plot(lda.fit, dimen = 2) +
  lda.arrows(lda.fit, myscale = 1)

## integer(0)

Plot of the LDA results shows that tax and rad has most effect to LD1 and age for LD2, and even so are the most influential linear separators for the clusters.

SuperBonus

# Add crime factor to data set scaled in last code chunk

# create a quantile vector of crim and print it
bins <- quantile(standardised_boston[['crim']])
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(standardised_boston[['crim']], breaks = bins, include.lowest = TRUE)

# remove original crim from the dataset
standardised_boston <- dplyr::select(standardised_boston, -crim)

# add the new categorical value to scaled data
standardised_boston <- data.frame(standardised_boston, crime)

# Set crime to character fow a while 
standardised_boston$crime<-as.character(standardised_boston$crime)

# Set 'crime' to factor variable, and give relevant levels
standardised_boston$crime[standardised_boston$crime=="[-0.419,-0.411]"] <- "low"
standardised_boston$crime[standardised_boston$crime=="(-0.411,-0.39]"] <- "med_low"
standardised_boston$crime[standardised_boston$crime=="(-0.39,0.00739]"] <- "med_high"
standardised_boston$crime[standardised_boston$crime=="(0.00739,9.92]"] <- "high"
# back to factor with correct levels
standardised_boston$crime<- factor(standardised_boston$crime, levels = c("low", "med_low", "med_high", "high"))

# Check head of edited table
head(standardised_boston)
##           zn      indus       chas        nox        rm        age      dis
## 1  0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948 0.140075
## 2 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824  0.3668034 0.556609
## 3 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490 0.556609
## 4 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878 1.076671
## 5 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743 1.076671
## 6 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100 1.076671
##          rad        tax    ptratio     black      lstat       medv N_clusters
## 1 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990  0.1595278          1
## 2 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239          1
## 3 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324  1.3229375          1
## 4 -0.7521778 -1.1050216  0.1129203 0.4157514 -1.3601708  1.1815886          3
## 5 -0.7521778 -1.1050216  0.1129203 0.4406159 -1.0254866  1.4860323          3
## 6 -0.7521778 -1.1050216  0.1129203 0.4101651 -1.0422909  0.6705582          1
##   crime
## 1   low
## 2   low
## 3   low
## 4   low
## 5   low
## 6   low
# Build up test and train datasets

# choose randomly 80% of the rows
ind <- sample(nrow(standardised_boston),  size = n * 0.8)

# create train set
train <- standardised_boston[ind,]
dim(train)
## [1] 404  15
# create test set
test <- standardised_boston[-ind,]
dim(test)
## [1] 102  15
# Save n_cluster column
n_clusterit<- train$N_clusters

# remove original n_cluster from the dataset
train <- dplyr::select(train, -N_clusters)

# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2500000 0.2500000 0.2574257 0.2425743 
## 
## Group means:
##                   zn      indus        chas        nox         rm        age
## low       1.08456771 -0.9369443 -0.11640431 -0.8924742  0.5212608 -0.8766213
## med_low  -0.09115733 -0.3312543  0.03952046 -0.5917217 -0.1115500 -0.3625932
## med_high -0.39571399  0.1510421  0.14409500  0.3846657  0.1429959  0.4294170
## high     -0.48724019  1.0149946 -0.03128211  1.0406720 -0.4855777  0.7957547
##                 dis        rad        tax     ptratio      black       lstat
## low       0.8529692 -0.6987343 -0.7602375 -0.49441664  0.3799540 -0.79526046
## med_low   0.3642410 -0.5554602 -0.5409951 -0.04120056  0.3201782 -0.17644351
## med_high -0.3749755 -0.3822389 -0.2825025 -0.26326606  0.1074278 -0.05752024
## high     -0.8584374  1.6596029  1.5294129  0.80577843 -0.7464259  0.89613928
##                medv
## low       0.5929402
## med_low   0.0102126
## med_high  0.2096063
## high     -0.6231054
## 
## Coefficients of linear discriminants:
##                 LD1         LD2        LD3
## zn       0.11030024  0.90477115 -0.9432520
## indus   -0.02369693 -0.21284527  0.3534688
## chas    -0.06681638 -0.04469796  0.1555483
## nox      0.32271479 -0.76681432 -1.3321508
## rm      -0.11928537 -0.08149308 -0.1426609
## age      0.35340567 -0.29789983 -0.2669457
## dis     -0.08563196 -0.47438844  0.2270172
## rad      3.09835255  1.17955576  0.1675980
## tax      0.10830898 -0.36338770  0.2630112
## ptratio  0.15263545  0.06163493 -0.2299144
## black   -0.17731254  0.01064261  0.1035738
## lstat    0.14115142 -0.15376102  0.3960080
## medv     0.18691351 -0.36944033 -0.1857561
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9436 0.0416 0.0148
# Collect crime variables
target<- factor(train$crime)


# Collect predictor data
model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

# Next, access the plotly package. Create a 3D plot of the columns of the matrix.
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=target)
# Let's colorcode clusters
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=n_clusterit)

5: Dimensionality reduction techniques

Data wrangling

Data wrangling was started in last week, and script can be found from file create_human.R. There should be 195 observations and 19 variables. Most of the variables names has been shortened and couple new variables added to the original dataset as follow: * “GNI” = Gross National Income per capita * “Life.Exp” = Life expectancy at birth * “Edu.Exp” = Expected years of schooling * “Mat.Mor” = Maternal mortality ratio * “Ado.Birth” = Adolescent birth rate

  • “Parli.F” = Percetange of female representatives in parliament
  • “Edu2.F” = Proportion of females with at least secondary education
  • “Edu2.M” = Proportion of males with at least secondary education
  • “Labo.F” = Proportion of females in the labour force
  • “Labo.M” ” Proportion of males in the labour force

New variables * “Edu2.FM” = Edu2.F / Edu2.M * “Labo.FM” = Labo2.F / Labo2.M

There is still some editing e.g. Gender Inequality Index could be also shortened to ‘GII’. Lets’ start wrangling by that.

# Read file to R-environment
human_data<-read.table("human_data.tsv", header=TRUE, sep=";", dec=",")


# EDIT Gender Inequality Index column
colnames(human_data)[colnames(human_data)=="Gender Inequality Index"]<-"GII"

# Libraries
library(dplyr)

# Change GNI col to numeric by mutate
human_data$GNI<-human_data %>%
  dplyr:::select(GNI) %>%
  mutate(GNI =as.numeric(GNI))


# Exclude unneeded variables
keep <- c("Country", "Edu2.FM", "Labo.FM", "Edu.Exp", "Life.Exp", "GNI", "Mat.Mor", "Ado.Birth", "Parli.F")
human_data.subset <- human_data[,keep]

# Editing previous way 'GNI' make the column to class 'dataframe'. This would be problematic in plots
# so lets' changed it back to really numeric
human_data.subset$GNI<-as.numeric(unlist(human_data.subset$GNI))

# Remove rows with all NA
human_data.subset<-human_data.subset[rowSums(is.na(human_data.subset))!= ncol(human_data.subset),]
dim(human_data.subset)
## [1] 195   9
# Seems that there isn't any complete NA rows
# Lets' remove rows that contains even one NA
human_data.subset<-human_data.subset[complete.cases(as.matrix(human_data.subset)), ]
dim(human_data.subset)
## [1] 162   9
# After subsetting there are 162/195 observations which haven't had any missing variables

# Next we should remove observations related to region
# to find out which, unique values of Country column should be taken

paste(unique(human_data.subset$Country),collapse=', ')
## [1] "Afghanistan, Albania, Algeria, Arab States, Argentina, Armenia, Australia, Austria, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belgium, Belize, Benin, Bhutan, Bolivia (Plurinational State of), Bosnia and Herzegovina, Botswana, Brazil, Bulgaria, Burkina Faso, Burundi, Cambodia, Cameroon, Canada, Central African Republic, Chad, Chile, China, Colombia, Congo, Congo (Democratic Republic of the), Costa Rica, Côte d'Ivoire, Croatia, Cuba, Cyprus, Czech Republic, Denmark, Dominican Republic, East Asia and the Pacific, Ecuador, Egypt, El Salvador, Estonia, Ethiopia, Europe and Central Asia, Fiji, Finland, France, Gabon, Gambia, Georgia, Germany, Ghana, Greece, Guatemala, Guyana, Haiti, Honduras, Hungary, Iceland, India, Indonesia, Iran (Islamic Republic of), Iraq, Ireland, Israel, Italy, Jamaica, Japan, Jordan, Kazakhstan, Kenya, Korea (Republic of), Kuwait, Kyrgyzstan, Latin America and the Caribbean, Latvia, Lebanon, Lesotho, Liberia, Libya, Lithuania, Luxembourg, Malawi, Malaysia, Maldives, Mali, Malta, Mauritania, Mauritius, Mexico, Moldova (Republic of), Mongolia, Montenegro, Morocco, Mozambique, Myanmar, Namibia, Nepal, Netherlands, New Zealand, Nicaragua, Niger, Norway, Oman, Pakistan, Panama, Papua New Guinea, Paraguay, Peru, Philippines, Poland, Portugal, Qatar, Romania, Russian Federation, Rwanda, Samoa, Saudi Arabia, Senegal, Serbia, Sierra Leone, Singapore, Slovakia, Slovenia, South Africa, South Asia, Spain, Sri Lanka, Sub-Saharan Africa, Sudan, Suriname, Swaziland, Sweden, Switzerland, Syrian Arab Republic, Tajikistan, Tanzania (United Republic of), Thailand, The former Yugoslav Republic of Macedonia, Togo, Tonga, Trinidad and Tobago, Tunisia, Turkey, Uganda, Ukraine, United Arab Emirates, United Kingdom, United States, Uruguay, Venezuela (Bolivarian Republic of), Viet Nam, World, Yemen, Zambia, Zimbabwe"
# From the print:
regions<-c("South Africa","South Asia","Latin America and the Caribbean","Europe and Central Asia","East Asia and the Pacific","World","Sub-Saharan Africa")

# Subset regions from the human_data
human_data.subset <- subset(human_data.subset, !(Country %in% regions))

# We can check by dimensions and unique values that correct rows has been removed
dim(human_data.subset)
## [1] 155   9
# add country to rownames
rownames(human_data.subset) <- human_data.subset$Country

human_data.subset <- human_data.subset[,!(colnames(human_data.subset) %in% c('Country'))]

# Check dimensions, should be 155 x 8
dim(human_data.subset)
## [1] 155   8
write.csv2(human_data.subset, "human_data_30112022.tsv", row.names=TRUE)

Analysis

This week topic is Dimensionality reduction techniques.

Summary of the data

# Libraries
library(tidyverse)
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.2.2
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
library(ggpubr)
library(MASS)
library(ggplot2)
library(GGally)
library(reshape2)
library(tidyr)
library(corrplot)
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.2.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.2.2
# Let's try new package for producion of summary table and plot of the data
library(modelsummary)
## Warning: package 'modelsummary' was built under R version 4.2.2
# head of data
head(human_data.subset)
##               Edu2.FM   Labo.FM Edu.Exp Life.Exp   GNI Mat.Mor Ado.Birth
## Afghanistan 0.1987421 0.1979866     9.3     60.4  1885     400      86.8
## Albania     0.6854962 0.9306030    11.8     77.8  9943      21      15.3
## Algeria     0.2105263 0.8612903    14.0     74.8 13054      89      10.0
## Arab States 0.3081009 0.7289916    12.0     70.6 15722     155      45.4
## Argentina   0.6333333 0.9774306    17.9     76.3 22050      69      54.4
## Armenia     0.7465565 0.9894737    12.3     74.7  8124      29      27.1
##             Parli.F
## Afghanistan    27.6
## Albania        20.7
## Algeria        25.7
## Arab States    14.0
## Argentina      36.8
## Armenia        10.7
# Summary of the Human Data
datasummary_skim(human_data.subset)
Unique (#) Missing (%) Mean SD Min Median Max
Edu2.FM 155 0 0.7 0.2 0.2 0.8 1.0
Labo.FM 151 0 0.9 0.2 0.2 0.9 1.5
Edu.Exp 85 0 13.2 2.8 5.4 13.5 20.2
Life.Exp 112 0 71.7 8.3 49.0 74.2 83.5
GNI 155 0 17651.1 18539.2 581.0 12040.0 123124.0
Mat.Mor 88 0 149.2 211.8 1.0 49.0 1100.0
Ado.Birth 139 0 47.1 41.1 0.6 33.6 204.8
Parli.F 123 0 20.7 11.4 0.0 19.3 57.5
# Summary plot
## Scatterplot
ggpairs(human_data.subset, mapping = aes(), lower =list(combo = wrap("facethist", bins = 20))) 

# Correlation plot
datasummary_correlation(human_data.subset)
Edu2.FM Labo.FM Edu.Exp Life.Exp GNI Mat.Mor Ado.Birth Parli.F
Edu2.FM 1 . . . . . . .
Labo.FM .02 1 . . . . . .
Edu.Exp .05 .59 1 . . . . .
Life.Exp −.14 .59 .80 1 . . . .
GNI −.02 .43 .62 .63 1 . . .
Mat.Mor .24 −.66 −.74 −.87 −.50 1 . .
Ado.Birth .12 −.53 −.70 −.74 −.56 .76 1 .
Parli.F .26 .08 .21 .19 .09 −.09 −.07 1

There are differences between variables in value scale and distributions. E.g. our new variables has information combined from to column showing ratios, GNI has highest values and highest range. Overall, values difference from 0 to 123124 meaning that some sort of scaling and normalization should be done to make different variables to be comparable.

PCA raw data

Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and enabling the visualization of multidimensional data. Formally, PCA is a statistical technique for reducing the dimensionality of a dataset. This is accomplished by linearly transforming the data into a new coordinate system where (most of) the variation in the data can be described with fewer dimensions than the initial data. (wikipedia)

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_data.subset)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

From the plot, we can’t see much or use it like that for good interpretation. As far, we can say that GNI seems to influence PC1.

PCA standardized data

# standardize the variables
human_std <- scale(human_data.subset)

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)

After scaling we can see much better different variables influence to PC1 and PC2. Edu.Exp, Life.Exp, Labb,FM and GNI influence to PC1 forming a group as well as Mat.Mor and Ado.Birth which influence is opposite and forms a group which have similar influence to PC1. Let’s read a plot e.g. we can see that Niger and Sierra Leone has high Mat.Mor and Ado.Birth and low GNI, Labo.FM, Life.Exp and Edu.Exp. This sounds realistic. And variables values in scale for e.g Japan, Korea and Czech are opposite (on the other side of mean/median).
There could be also negative correlation with these groups, and inside group positive correlation with variables. This could be study more but we can validate this from ealier build summary plots where correlation values can be read. Overall, we can see that these group formed from similarly influencing variables divide our countries to groups which have same level ‘standard of living’. PC2 is influenced by Padi.F and Edu2.FM which have also positive correlation to each other. It shows that countries e.g. Bolivia, Rwanda and Namibia seems to have similar level education between gender and it positivily correlates with Percetange of female representatives in parliament. As opposite, e.g Jordan has high difference between genders education level and also Percetange of female representatives in parliament level is low.

Multiple Correspondence Analysis (MCA)

In statistics, multiple correspondence analysis (MCA) is a data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set. It does this by representing data as points in a low-dimensional Euclidean space. The procedure thus appears to be the counterpart of principal component analysis for categorical data. MCA can be viewed as an extension of simple correspondence analysis (CA) in that it is applicable to a large set of categorical variables.(wikipedia)

# Bring Tea data
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

# Study data
View(tea)
# Dimensions
dim(tea)
## [1] 300  36
# Colnames
colnames(tea)
##  [1] "breakfast"        "tea.time"         "evening"          "lunch"           
##  [5] "dinner"           "always"           "home"             "work"            
##  [9] "tearoom"          "friends"          "resto"            "pub"             
## [13] "Tea"              "How"              "sugar"            "how"             
## [17] "where"            "price"            "age"              "sex"             
## [21] "SPC"              "Sport"            "age_Q"            "frequency"       
## [25] "escape.exoticism" "spirituality"     "healthy"          "diuretic"        
## [29] "friendliness"     "iron.absorption"  "feminine"         "sophisticated"   
## [33] "slimming"         "exciting"         "relaxing"         "effect.on.health"
# Summary
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   +60  :38   +2/day     :127  
##  middle      :40   sportsman    :179   15-24:92   1 to 2/week: 44  
##  non-worker  :64                       25-34:69   1/day      : 95  
##  other worker:20                       35-44:40   3 to 6/week: 34  
##  senior      :35                       45-59:61                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 
# Summary plot
# Let's visualize part of the data
keep_columns_2<- c("price","Tea")
# Mosaic plot of couple variables
mosaicplot(table(tea[,colnames(tea) %in% keep_columns_2]), xlab='Tea', ylab='Price',main='Tea and price', col='blue')

# Age has to be factorized by bins or removed --> remove
tea<-tea[!(colnames(tea) %in% 'age')]

# Multiple Correspondence Analysis (MCA)
# Could be set to vizualize also the correlation between variables and MCA principal dimension
res.mca<- MCA(tea, graph = FALSE)
# With so many variables the printed MCAfactor plots becomes harder to read. 
# The object of MCA can be used for many different kind of visualizations -->
# Extract and save results for later use
var <- get_mca_var(res.mca)
var
## Multiple Correspondence Analysis Results for variables
##  ===================================================
##   Name       Description                  
## 1 "$coord"   "Coordinates for categories" 
## 2 "$cos2"    "Cos2 for categories"        
## 3 "$contrib" "contributions of categories"
# Coordinates
head(var$coord)
##                    Dim 1        Dim 2       Dim 3       Dim 4       Dim 5
## breakfast      0.1818793  0.019872372 -0.10740185 -0.55341901 -0.09826250
## Not.breakfast -0.1678886 -0.018343728  0.09914017  0.51084831  0.09070384
## Not.tea time  -0.5562500  0.004261602  0.06242859  0.09798477  0.16970005
## tea time       0.4311760 -0.003303372 -0.04839139 -0.07595269 -0.13154265
## evening        0.2760859 -0.408578586  0.34396723  0.23142519 -0.26420301
## Not.evening   -0.1443495  0.213622307 -0.17984073 -0.12099896  0.13813660
# Cos2: quality on the factore map
head(var$cos2)
##                    Dim 1        Dim 2       Dim 3       Dim 4       Dim 5
## breakfast     0.03053547 3.645334e-04 0.010647837 0.282713168 0.008912786
## Not.breakfast 0.03053547 3.645334e-04 0.010647837 0.282713168 0.008912786
## Not.tea time  0.23984168 1.407766e-05 0.003021007 0.007442207 0.022322794
## tea time      0.23984168 1.407766e-05 0.003021007 0.007442207 0.022322794
## evening       0.03985286 8.728150e-02 0.061859319 0.028002208 0.036496104
## Not.evening   0.03985286 8.728150e-02 0.061859319 0.028002208 0.036496104
# Contributions to the principal components
head(var$contrib)     
##                   Dim 1        Dim 2      Dim 3     Dim 4     Dim 5
## breakfast     0.5036943 0.0066328082 0.22530472 6.7101105 0.2373656
## Not.breakfast 0.4649486 0.0061225922 0.20797359 6.1939482 0.2191067
## Not.tea time  4.2859707 0.0002774934 0.06925046 0.1913584 0.6440432
## tea time      3.3222613 0.0002150984 0.05367935 0.1483310 0.4992288
## evening       0.8301633 2.0055059548 1.65293457 0.8393006 1.2274180
## Not.evening   0.4340448 1.0485640271 0.86422467 0.4388221 0.6417465
# visualize MCA - many possibilites

# Eigenvalues/Variances
# In linear algebra, an eigenvector or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted by λ \lambda , is the factor by which the eigenvector is scaled.
# Geometrically, an eigenvector, corresponding to a real nonzero eigenvalue, points in a direction in which it is stretched by the transformation and the eigenvalue is the factor by which it is stretched. If the eigenvalue is negative, the direction is reversed.[1] Loosely speaking, in a multidimensional vector space, the eigenvector is not rotated.(wikipedia)

# get values
eig.val <- get_eigenvalue(res.mca)
# Dataframe contains also variance % and cumalative variance %. Those can be used to visualize how much of yhre variance is explained by it dimensions e.g. in histplot
head(eig.val)
##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1 0.09006849         5.837772                    5.837772
## Dim.2 0.08165357         5.292361                   11.130133
## Dim.3 0.07021443         4.550936                   15.681069
## Dim.4 0.06259673         4.057196                   19.738264
## Dim.5 0.05578674         3.615807                   23.354071
## Dim.6 0.05345502         3.464677                   26.818748
# Build plot
fviz_screeplot(res.mca, addlabels = TRUE, ylim = c(0, 45))

# Biplot
# The plot shows a global pattern within the data. Rows (individuals) are represented by blue points and columns (variable categories) by red triangles.
# The distance between any row points or column points gives a measure of their similarity (or dissimilarity). Row points with similar profile are closed on the factor map. The same holds true for column points.
fviz_mca_biplot(res.mca, 
               repel = TRUE,
               ggtheme = theme_minimal())
## Warning: ggrepel: 279 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 59 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

# Contribution of variable categories to the dimensions 1-2
# The red dashed line on the graph indicates the expected average value, If the contributions were uniform.
fviz_contrib(res.mca, choice = "var", axes = 1:2, top = 15)

# Cos2
# The quality of the representation is called the squared cosine (cos2), which measures the degree of association between variable categories and a particular axis. 
# Color by cos2 values: quality on the factor map
fviz_mca_var(res.mca, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE, # Avoid text overlapping
             ggtheme = theme_minimal())
## Warning: ggrepel: 61 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

# Contribution
# The plot above gives an idea of what pole of the dimensions the categories are actually contributing to. 
# Color by contribution values: quality on the factor map
fviz_mca_var(res.mca, col.var = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE, # avoid text overlapping (slow)
             ggtheme = theme_minimal()
             )
## Warning: ggrepel: 61 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

# From the plots we can read many things and make assumptions and maybe even conclusions, but also more test for that should be done. We can see that e.g tea shop, none working and older age seems to have positive influence to dim 2 where as student which positively correlates with age group 15-24 have negative influence to dim 2. Could this indicate that there are age and civil status related difference in tea drinking habits? Also location where tea is drinked seems to correlate with lunch tea consumption according dim 1. Could this indicates that workers like to drink their lunch tea typically in pub, chain store+ tea shop and tearoom rater than in home?

6: Analysis of longitudinal data

RATS Data analysis

Bring data

# Bring data to R and check summaries
ratsl=read.table('ratsl.txt', header=TRUE)
str(ratsl)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group         : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ time_variables: chr  "WD1" "WD1" "WD1" "WD1" ...
##  $ rats          : int  240 225 245 260 255 260 275 245 410 405 ...
##  $ time_variable : int  1 1 1 1 1 1 1 1 1 1 ...
ratsl$Group = factor(ratsl$Group)
ratsl$ID = factor(ratsl$ID)
str(ratsl)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID            : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group         : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ time_variables: chr  "WD1" "WD1" "WD1" "WD1" ...
##  $ rats          : int  240 225 245 260 255 260 275 245 410 405 ...
##  $ time_variable : int  1 1 1 1 1 1 1 1 1 1 ...

Graphical Displays of Longitudinal Data

# Needed R-libraries
library(dplyr);library(tidyr);library(ggplot2);library(rstatix)

# Draw the plot
ggplot(ratsl, aes(x = time_variable , y = rats, linetype = ID, color= ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(ratsl$rats), max(ratsl$rats)))

# Standardize the data to see how it effects to the plots

ratsl <- ratsl %>%
  group_by(time_variable) %>%
  mutate( stdratsl = (rats - mean(rats))/sd(rats) ) %>%
  ungroup()

# Check edited data frame
str(ratsl)
## tibble [176 × 6] (S3: tbl_df/tbl/data.frame)
##  $ ID            : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group         : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ time_variables: chr [1:176] "WD1" "WD1" "WD1" "WD1" ...
##  $ rats          : int [1:176] 240 225 245 260 255 260 275 245 410 405 ...
##  $ time_variable : int [1:176] 1 1 1 1 1 1 1 1 1 1 ...
##  $ stdratsl      : num [1:176] -1.001 -1.12 -0.961 -0.842 -0.882 ...
glimpse(ratsl)
## Rows: 176
## Columns: 6
## $ ID             <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ Group          <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1…
## $ time_variables <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1",…
## $ rats           <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, …
## $ time_variable  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 15, 15,…
## $ stdratsl       <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819…
# Build new ggplot by using standardised values - mean profiles
ggplot(ratsl, aes(x = time_variable, y = stdratsl, linetype = ID, color = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") +
  scale_y_continuous(limits = c(min(ratsl$stdratsl), max(ratsl$stdratsl)))

# Summary data with mean and standard error of bprs by treatment and week 
ratss <- ratsl %>%
  group_by(Group, time_variable) %>%
  summarise( mean = rats, se = rats) %>%
  ungroup()
## `summarise()` has grouped output by 'Group', 'time_variable'. You can override
## using the `.groups` argument.
# Glimpse the data
glimpse(ratss)
## Rows: 176
## Columns: 4
## $ Group         <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ time_variable <int> 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, 8, 8, 8, 15, 15, …
## $ mean          <int> 240, 225, 245, 260, 255, 260, 275, 245, 250, 230, 250, 2…
## $ se            <int> 240, 225, 245, 260, 255, 260, 275, 245, 250, 230, 250, 2…
# Plot the mean profiles
ggplot(ratss, aes(x = time_variable, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.8,0.8)) +
  scale_y_continuous(name = "mean(ratss) +/- se(ratss)")

# Create a summary data by Group and subject with mean as the summary variable (ignoring baseline time variable 1)
ratsl8s <- ratsl %>%
  filter(time_variable > 1) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(rats) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Glimpse the data
glimpse(ratsl8s)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, 4…
# Draw a boxplot of the mean versus treatment
ggplot(ratsl8s, aes(x = Group, y = mean, group=Group, color=Group)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(rats), time variable 1-64")

# Lets' add names to outliers so that those can be easily locate and delete
## Function for finding those
is_outlier <- function(x) {
  return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}

ratsl8s %>%
  group_by(Group) %>%
  mutate(outlier = ifelse(is_outlier(mean), mean, as.numeric(NA))) %>%
  ggplot(., aes(x = Group, y = mean, group=Group, color=Group)) +
    geom_boxplot() +
    geom_text(aes(label = outlier), na.rm = TRUE, hjust = -0.3) +
    stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
    scale_y_continuous(name = "mean(rats), time variable 1-64")

# Create a new data by filtering the outlier 
## From the plot we can read that 238.9 , 594 & 495.2 are outliers
ratsl8s_wo_O <- subset(ratsl8s,!(mean %in% c(238.9,594,495.2)))

# Let's use same function with ggplot so we can be sure that there isn't any outliers
ratsl8s_wo_O %>%
  group_by(Group) %>%
  mutate(outlier = ifelse(is_outlier(mean), mean, as.numeric(NA))) %>%
  ggplot(., aes(x = Group, y = mean, group=Group, color=Group)) +
    geom_boxplot() +
    geom_text(aes(label = outlier), na.rm = TRUE, hjust = -0.3) +
    stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
    scale_y_continuous(name = "mean(rats), time variable 1-64")

# Now the boxplots seems better so we can make statistical test
## Because we multiple groups, we will use pairwise t-tests and bonferroni p.adjust.method
pairwise_t_test(mean ~ Group, data=ratsl8s_wo_O, p.adjust.method = "bonferroni")
## # A tibble: 3 × 9
##   .y.   group1 group2    n1    n2        p p.signif    p.adj p.adj.signif
## * <chr> <chr>  <chr>  <int> <int>    <dbl> <chr>       <dbl> <chr>       
## 1 mean  1      2          7     3 3.99e-13 ****     1.2 e-12 ****        
## 2 mean  1      3          7     3 8.71e-15 ****     2.61e-14 ****        
## 3 mean  2      3          3     3 3.86e- 9 ****     1.16e- 8 ****
# From the result table we can see that there are significant differences between all groups


# Add the baseline from the original data as a new variable to the summary data
rats <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')

ratsl8s_2 <- ratsl8s %>%
  mutate(baseline = rats$WD1)

# Fit the linear model with the mean as the response 
fit <- lm(mean ~ baseline + Group, data = ratsl8s_2)
summary(fit)
## 
## Call:
## lm(formula = mean ~ baseline + Group, data = ratsl8s_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.905  -4.194   2.190   7.577  14.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.16375   21.87657   1.516   0.1554    
## baseline     0.92513    0.08572  10.793 1.56e-07 ***
## Group2      34.85753   18.82308   1.852   0.0888 .  
## Group3      23.67526   23.25324   1.018   0.3287    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.68 on 12 degrees of freedom
## Multiple R-squared:  0.9936, Adjusted R-squared:  0.992 
## F-statistic: 622.1 on 3 and 12 DF,  p-value: 1.989e-13
# From the linear model summary we can see that baseline seems to best predictor for the mean.

BPRS Data analyses

A number of graphical displays which can be useful in the preliminary assessment of longitudinal data from clinical trials will now be illustrated using the data shown below taken from Davis (2002). Here 40 male subjects were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, halluci- nations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia

Bring data

# Bring data to R
bprsl=read.table('BPRSL.txt', header=TRUE)
str(bprsl)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : chr  "week0" "week0" "week0" "week0" ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
bprsl$treatment = factor(bprsl$treatment)
bprsl$subject <- factor(bprsl$subject)

str(bprsl)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : chr  "week0" "week0" "week0" "week0" ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...

Analysis of Longitudinal Data II: Linear Mixed Effects Models for Normal Response Variables

# Needed R-libraries
library(dplyr);library(tidyr);library(ggplot2);library(rstatix);library(lme4)
## Warning: package 'lme4' was built under R version 4.2.2
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.2.2
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
# Plot the RATSL data
ggplot(bprsl, aes(x = week, y = bprs, color = subject)) +
  geom_line(aes(linetype = treatment)) +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) +
  scale_y_continuous(name = "BPRS") +
  theme(legend.position = "top")

# create a regression model bprsl_reg
bprsl_reg <- "Regression model here!"
# Fit the linear model with the mean as the response 
bprsl_reg <- lm(bprs ~ treatment + week, data = bprsl)
summary(bprsl_reg)
## 
## Call:
## lm(formula = bprs ~ treatment + week, data = bprsl)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16
# Week aka time seems to be better predictor than treatment

# Lets' try another model
# Create a random intercept model which takes account random effect caused by patient aka subject
bprsl_ref <- lmer(bprs ~ treatment + week + (1 | subject), data = bprsl, REML = FALSE)
summary(bprsl_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ treatment + week + (1 | subject)
##    Data: bprsl
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## treatment2    0.5722     1.0761   0.532
## week         -2.2704     0.2084 -10.896
## 
## Correlation of Fixed Effects:
##            (Intr) trtmn2
## treatment2 -0.282       
## week       -0.437  0.000
# Also time - subject combination might cause random effect
# create a random intercept and random slope model
bprsl_ref1 <- lmer(bprs ~ treatment + week + (week | subject), data = bprsl, REML = FALSE)
summary(bprsl_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ treatment + week + (week | subject)
##    Data: bprsl
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## treatment2    0.5722     1.0405   0.550
## week         -2.2704     0.2977  -7.626
## 
## Correlation of Fixed Effects:
##            (Intr) trtmn2
## treatment2 -0.247       
## week       -0.582  0.000
# Two models results can be compared
# perform an ANOVA test on the two models
anova(bprsl_ref, bprsl_ref1)
## Data: bprsl
## Models:
## bprsl_ref: bprs ~ treatment + week + (1 | subject)
## bprsl_ref1: bprs ~ treatment + week + (week | subject)
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## bprsl_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## bprsl_ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# From the results we can say that later model seems to be better one

# create a random intercept and random slope model with the interaction
bprsl_ref2 <- lmer(bprs ~ treatment * week + (week | subject), data = bprsl, REML = FALSE)
summary(bprsl_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ treatment * week + (week | subject)
##    Data: bprsl
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0513 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 65.0049  8.0626        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4702  9.8219        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2522  21.262
## treatment2       -2.2911     1.9090  -1.200
## week             -2.6283     0.3589  -7.323
## treatment2:week   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) trtmn2 week  
## treatment2  -0.424              
## week        -0.650  0.469       
## tretmnt2:wk  0.356 -0.840 -0.559
# Compare two latest model
# perform an ANOVA test on the two models
anova(bprsl_ref2, bprsl_ref1)
## Data: bprsl
## Models:
## bprsl_ref1: bprs ~ treatment + week + (week | subject)
## bprsl_ref2: bprs ~ treatment * week + (week | subject)
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## bprsl_ref1    7 2745.4 2772.6 -1365.7   2731.4                       
## bprsl_ref2    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# draw the plot of bprsl with the observed Weight values
ggplot(bprsl, aes(x = week, y = bprs, colour = subject)) +
  geom_line(aes(linetype = treatment)) +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) +
  scale_y_continuous(name = "Observed BPRS") +
  theme(legend.position = "top")

# Create a vector of the fitted values
Fitted <- fitted(bprsl_ref2)

# Create a new column fitted to bprsl
bprsl$Fitted_bprs = Fitted

# draw the plot of bprsl with the Fitted values of weight
ggplot(bprsl, aes(x = week, y = Fitted_bprs, colour = subject)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  geom_line(aes(linetype = treatment)) +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) +
  scale_y_continuous(name = "Observed BPRS") +
  theme(legend.position = "top")

# After fitting the model, we can clearly see that there are negative correlation with weeks and BPRS in both treatment groups

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.


(more chapters to be added similarly as we proceed with the course!)